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ABSTRACT 

The New York State Department of Civil Servi 
investigated an empirical application of the Rasch mode], whic 
appears useful in Civil Service testing. The model is a power 
for developing insights into what test items are measuring vh 
permitting the investigator to spot defective items. It also 
meaningful distinctions in the type of task set by different 
Two sets of specific examples are discussed to illustrate its 
usefulness. The first set of examples considers the use of it 
probability as an index of the degree of fit of the material 
model, while the second discusses the "normal deviate" matrix 
displays the goodness of fit of each item at each score group 
enables an investigator to ascertain the overall validity of 
general index. These examples demonstrate the applicability o 
Rasch model to a variety of conditions. The author suggests 
model seems promising for civil service testing since it is 



simply a means to derive scores but 
analysis, construction, and design. 



is also a powerful tool 
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Intro due tiem 



At the New Yolk State Department of Civil Service v;e have been working 
exclusively with only one of the » ode l.s * developed by R isch. Our interest has 

been focused on the model discussed by tfri gh i> at the 1 96? CIS Invitational 

Z 

Testing Conference. In is particular model is occasionally referred to as ~ne 
log-odds no del and basically treat? the responses to a lest iron in tozni* of 
dichotomies such as right a-d wrong. 

Our testing program places a heavy emphasis upon Die multiple -choice typo 
of written test item. In general we use ]\ or J? choice i i cr.s which arr? scored as 
light or wrong. Ihese test items are generally grouped into sets of 1 £ or 20 
questions ^ vjhich wo call subtostsj vrhich are designed to measure some pax*ticular 
area. We have analysed a variety of our subUrsts with the Rasch Model since we 
began working with it. Vie have analysed sub tests in: 



Abstract reasoning 

Reading comprehension 

Supervision 

Economics 

Vocabulary 

Spelling 

Spatial relations 



Quanti l?. live reason! rig 
Statist ion 

Arini riir? t rati v s judgmr n t 
budgeting 

In 1 cjvl =.- v* i ng techniques 
Report van ring 



to name several. Some of our sub tests night Vr called aptitude teste and others 

achievement tests by educators. Unfortunately the public personnel selection 

community and the educational community occasionally use different labels for 

similar 'dcas. As a rebuilt I will try to limit my use of labels in order to 
avoid misunderstanding. 

LRs as eh . fj . Pr 'c\ b i 1 i r i i c r od ? Is i\t r on ^ i n i * 1 1 :_r e n c. r. i = r •] a 1 1 tv: n i -* s 1 r: , 
h nn . 5iifiBf.7imii Coper). agon ; Iraki jir.til.uu: Car Iv: r.I i I /V .sc > vc , k»6o. g -rs Y-V7i,>. 



Ob^(vti\'e 

In our oipirical work with the Rasch Model we have been exploring a 
variety of issues simply because vie under stand very few of them and have virtual!,; 
no recourse to published uoxirs wince they are relatively scarce. 

One of the issues we have h-or. exploring is one tiiat world like to relate 
to you today. It is an empirical application of the Model that has aroused our 
enthusiasm* appears to be particular! ly useful for civil service testing and 

m 

never theless is one that we have not rand much about in the. literature. 

Our empirical investigation of the Rasch Model on test items, measuring a 

wide variety of areas, has led us to believe that the Rasch Model may be a pew erf 

tool for developing insights into what test items are jueasurlng. VJe belrexa find; - 

that tVs information caa then be used to vu.ke practical decisions abcut Item 

construction and/or test design. 

< 

Specifically our experience has shcr.m ns that the. Rasch Mode]. permits ur 
to 



1 . spot defective iters, in the traditional sense such as items 
without good key answers; and 

2. spot items which do not belong in a sub test, in the sense for 
example, of items which are functioning bs reading comprohen- 
sion items or quantitative rcafjoning ito/iu; in subject Matter 
teste , 

V. r e have found repeated instances of the node! quickly and clearly pointing 
up meaningful d] stinctions in the typo of task set by different items. 

It would be unreasonable for me to expect to convince you, within Ike 
limited span of time I have to>ay, of this general applicability of the Rasch 
Model which we nave repeatedly found and believe may be an. important 

aspect of the R?.sch Model. 



2. Wright, D. ’’Semple- free test calibration .ni person r.ea r /in ament. . * f In 
O of the 1967 Invitational Confer*:;^" on jeskrr .'’uM-.v-. inn^c* o?i 

ERIC Testing Smvicci, T«St>7 "ill, ‘5iT-10K 
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hThat I propose to do in the time remaining however, is to discuss various 
specific examples of our work which havn sparked our cnihusir.tnii, and helped 
formulate our current viev: in the belief, that for so w of you., this my be a 
useful application, and in the hope thud you my become sufficiently interested 
to explore these aspects more carefully and systematically vmh your o \n material , 
in orucr to nail doyn this particular application, if in fact it is a real one. 



of those misunderstandings that occasir-rnlly occur between lea public poi-t'cnncl 
selection sector and the educational community. At the first A fin A presession on 
the Eusch Model conducted by Dr, Yiright a fev years ago in California , the 
participants were encouraged to bring data with the::; io be avfl.y/cd at the pre- 
session. 

One of our colleagues from the ‘hilled State;, Civil Service brcufhi r?lorg 
data on one of thiir widely used tests. His data was in alphabetic foim, that is 
the alternatives to test items vrere A, B, 0, etc,, rather than numeric or 1, 2, 3, 
utc, Unfortunately, the programs used for the analysis at UCT-A for tho pres or. si or. 
were designed to accept only numeric data, (}Vrciilhc1i cally I might note that 
the data I brought to the sarna session suffered the cams fats. In fact the data 
brought by those in the educational eer.munity were only mi/i.orAc vdiil«* thut-e from 
public personnel agencies brought only alphabetic data and norther » bought to 
question this point until it was too late.) 7* 'rtunatoly, v:e lied a working computer 
program at the Kcvr York State Department of 'ivil Service which could handle both 
alphabetic and numeric data and *»c agreed to analyze the U. S. Civil Service data 
vhen >;e returned. 



I tom Frobabi ] i ey 



Example //I : 
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Their (Hta v;s.o sralecf'sv uy then for analyst s because if had beer: reviewed 
over the years by traditional analysis and it. was felt that this set of items 
could reasonably be expected to fit the assunptxons of the IViscb nod el and there- 
fore could serve as a useful means of gaining a better understanding of the Model . 

The data consisted of the responses of 9flV Kiddle managers from various 
agencies in the Federal Gc-vcmnenf tc a set of 2 C J Reading Comprehension items. 

V’e submitted tills data to an analysis by the Hanoi 1 Model and exxmi oecl one of the 
indices which appears as a routine output on our program and which is used to 
detc-snaine uhc degree of fit of the material to the model. lie index u;;ed is 
called the M ifen probability" and is the probability of the observations, given 
the item fits the model* For each test it~m an "item probability’ 1 is calculated* 
Glancing at the values of the "item probabj 15 tics” for this data we noticed several 
items whose f! i imn probabil}. uios f1 \:ere extremely low in oof'f: risen to the other items. 
In particular *;o noticed four items with "item probabilities” of 0.000* 

So, v;e looked at the content of all the items and fovi«'d that the items corid 
bo answered correctly based solely on the information presented i.ti the stem, except 
for the four items with the low ’’item probabilities” . These four items required 
the candidate to bring additional knowledge to the item in order to correctly 
select thu key answer, a requirement that was not present in the other item?*, 

Next, v;e looked at the 11 item probabilities 11 of all 2J> items again and 
noticed three items with relatively high "item probabilities". A look at the 
content of these items revefiled a dramatic shift in the vocabulary level in these 
items , as compared to the other 22 itr-ms. These three items contained phrases 
such as "organised acquisitive procedure for self Maintenance" as a definition of 
war; and "non-pecuniary incentive.” 

The use of the "item probabilities" in this way, for this set of data, seemed 
to enable us to spot items that v.vre functioning differently than the total set el 

O 
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1 terns and v;e seemed to bo able to find reasonable explanations o.f the nature of 
the functioning differences. 

Example //2: • 

In another instance , we examined a sot of test items that we developed as 
part of a promotion examination for second level professional research positions. 

He call this examination the Research Cervices series. For one of these positions 
we developed a sot of 30 questions in the area of social research. The items were 
all designed to measure the sane general area. For purposes of analysis the 30 
questions were treated as two blocks of 1 5 it r -ms . Fach block of If? items was sub- 
jected to a Easch analysis (11-261 ) and both fit the model (p - 0.h7k> and p “ 0.1jl6 
respectively) . 

For the first set of 1 £ items wo again looked at the "item prubebil ities . 
Four items had "item probabilities 11 that wen* relatively high in comparison to the 
rest of the items. These four items h;:d probwbili V ; os greater than 0.6!; while the 
remaining eleven items wore all below 0 In keeping with our prior experience 
we hypothesised that the four items with the relatively high "item probabilities" 
had some kind of common feature that could explain the:. * s Ini.lar extreme prob abil- 
ities. After examining the content of each of the item? it appeared as if v,e 
could identify a concept that- set these four items apart. These four items were 
unique from the rest of the subtest in that those items seemed to require a person 
ip be avrare of the practical considerations as well as the theoretical and ideal 
approaches to research. The other 11 items did not have this requirement. 

To explore this hypothesis further we looked at the second set of items 
and found two items which seemed to embody the same concept that we found in the 

four items of the first set. A check of their 11 item probabilities 11 showed that 

* 

these two items had the lowest probabilities of fit on the entire subtest, that is, 
these tv;o iitros were deviant from Iho rest of the sot of liens in their "item 
O bilities” and in a similar manner. Thus the Iksjch analysis seemed to indicate 
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items with extreme 11 item probabilities 1 * embodying a common concept. 



No rm al D evi ate 

There is an alternative method of evaluating the fit of items which involves 
an examination of the "normal deviate" matrix. This matrix displays the goodness- 
of-f • h of each item at each score group. It enables us to look for patterns as 
v;ell as locate the precise point of poor fit if it exists. This often helps us to 
determine if a general index of poor fit is important to consider or not. For 
example, if the overall fit of the item is pool as reflected in a general index, 
but an examination of the "normal deviate" matrix shows us that the fit is poor 
only at one score group and if that score group is an unusually small one, we are 
likely to overlook the general index or at least not pay too much attention to it, 
The "normal deviate" matrix possesses a general index of fit too called the "mean 
square fit", The "mean square fit" and the "item probability 11 are related to each 

f r * 

other and either can be used, Some of the examiners in our agency prefer one, 
wnilo others prefer the other. While either can be used, the personal preferences 
of the examiners performing the analysis are factors to be considered, especially 
if one has to work with them on a daily basi.s. Jn older to preserve a degree of 
haraony among my colleagues I intend to now give "equal time" to the "mean square 
fit" advocates. . 

Example #1 : 

You nay recall that the last examination I was discussing was the Research 
Services examination, which was a promotion examination for second level pro- 
fessional research positions. Another part of that examination was a subtest on 
"tabular interpretation", For this subtest, the candidate is required to recognize 
relationships in and draw inferences from data presented in tabular form. The 
items stress the ability to deal with relationships among the various categories 
in the tables but minimizes the need to perform computations to arrive at the 
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For this material the analysis with the Rasch Model was a bit more involved. 
YJe have three alternate forms of the Research Services examination; Forms A, B, & C. 
Each form contains a subtest on "tabular interpretation". YJe use a different 1$ 
item 11 tabular interpretation" subtest on each Form, however each subtest contains 
some items common to at least one of the other subtects* 

YJe analyzed the subtest from Form A fn « 2?£) with the Hasch Model. One 
item possessed an extreme "mean square fit" of 3.2fj> as compared to the rest of the 
items which had "mean square fits" of less than 2.00. 

YJe then examined the item and seemed to find: 

1 . the .item required a successful manipulation of a series of 
complex relationships vrhich was not required by the other 
items, and 

2. one of the wrong answers could be arrived at by performing 

r 

all but the final operation. 

YJe felt that perhaps many of the better candidates understood the item but 
were omitting the final operation and therefore getting the item wrong. 

Fortunately, we were in a position to evaluate the tentative conclusions. 
Form B, an alternate Form for this examination also contained this item, which for 
purposes of our discussion I will call item X. In Form B, a different item 
preceded item X, and thi3 nevr item could be answered correctly by perfoming all 
but the final operation required for item X. It was expected that the better 
candidates v:ouid get this new item correct and then would realise that the answer 
to item X simply required an additional operation. Rasch analysis of Form B 
(n ** 188) supported our expectation. Both the new item and item X had "mean square 
fits" slightly below 1 .00. 
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In examining the "normal deviate" matrix for Folio. B we a3so found a single 
item with (in extremely high "mean square fit" (KSF - 7.8?). A review of all the 
items revealed that tills particular item was unique in that the candidate was 
required to deal with an indeterminacy to arrive at the connect, answer. In the 
other items the answer could be found by simply filling in all hissing entries 
in the table. It was felt, that manipulating all possible combinations through 
an indeterminacy may require a higher level of reasoning ability then simply filling 
in missing entries. This item was also included in Form C (n « 2 J 4 S) and again 
was the only item with a high "mean square fit" (MSF - ^.93) • The Rasch analysis 
seemed to again pinpoint the item that was functioning j n a manner different from 
the rest of the subtest. 

Example #2; 

I would like to discuss one final example. We have not limited our 
investigations to tests for positions that require a college education as the 
previous examples might lead you to believe. For example, let us look at an 
examination for Clerks. As part of our promotion examination to second level 
clerical positions we include a subtest of 15 items on vocabulary. The items in 
this subtest present five alternatives from which the candidate is to select the 
one most similar in meaning to the word given in the stem of the item. 

We subjected the 15 vocabulary items to a Rasch analysis (n = 3>6?u) snd 
found one item with a high "mean square fit" of 1?.?6. An inspection of this item 
revealed: 



1 . for the word presented in the stem, it was possible to derive 
an entirely different word by changing only one or two letters, 
and 

4 

2. the meaning of this new v:ord was one of the incorrect alternatives 
tliat could be chosen by the candidate. 
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Iu another vocabulary subtest for an alternate form of this examination 
we found t wc items vjit'n similar properties. A Rasch Analysis of the data 
(n •- 3, 998) showed that these two items also had high "mean square fits". 

SlflStARY 

Thera are many other examples that I could present to you but I believe 
the ones that have been presented here today give you an idea of the application 
that we see as one of the most premising for civil service testing. I discussed 
items in reading comprehension, social research, tabular interpretation, and 
vocabulary. I have attempted to present items from a variety of areas to give 
you an idea of the general applicability we have found. 

I discussed examples of single subtests, multiple subtests with common over- 
lapping items, as well as separate subtests with identical item construction 
concepts, I have tried, irf this way, to illustrate the persistence of our finding 
across a variety of conditions. 

As for our subjects, they were both college as well as non-col.lcge personnel . 
They included professional technical specialists, professional managers or 
generalists, and clerical or non-professional personnel. In this way I have 
at temp ted to give jou an idea of the wide subject levels that we find suc^imb 
to the Rasch Analysis. 

Within this brief period that I have had with you I have attempted to 
illustrate that, despite vdde variations of item content and subject composition, 
the Rasch analysis produced persistent useful results. 

Our feeling about the applicability of the Rasch Model to civil service 
testing perhaps is best sunnariaed by a comment made by Dr, Albert P. Has low. 

Chief of the Personnel Measurement Research and Development Center, of tho United 

3 * 

States Civil Service in reviewing an article on t u * Rasch Model to bo published 
in an upcoming issue of the Public Personnel Revie. j: 

™ 0 * 

,i)urovic, J, ’’Improving Written Test Measurement in Personnel Selection." Ftibl.iL 
iifiiimiflTiffTiiaaa Rr*rson ,w *3 in piint. I vJ 
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,] I believe that a ma^cr value of the Rasch Model may prove nob to be 
simply as a means to derive scores, but as a powerful tool for developing insight, 
about what it is that test items are measuring; how the items in a test relate to 
one another, and how these insights can apply to the improvement of test design, 
item construction, and test analysis.” 



r 




11 



